Introduction

Parallel computing has become an essential technology in modern computers, driven by the constantly increasing demands for higher performance, lower costs, and sustained productivity in real applications. There are concurrent events taking place in today's high performance computers due to the common practices of multiprogramming, multiprocessing, or multi-computing.


Parallelism can take the form of look ahead, pipelining, vectorization, concurrency, simultaneity, data parallelism, partitioning, interleaving, overlapping, multiplicity, replication, time sharing, space sharing, multitasking, multiprogramming, multithreading, and distributed computing at different processing levels.

Parallel computing is a type of computation where many calculations are performed at the same time, operating under the principle that large problems can often be divided into smaller ones, which are then solved together concurrently.

Types of Classification

The following classifications of parallel computers have been identified:

🔄

Classification based on the instruction and data streams

Focuses on how instructions and data flow through the system

🏗️

Classification based on the structure of computers

Examines the physical organization and architecture

💾

Classification based on how the memory is accessed

Analyzes memory access patterns and organization

📏

Classification based on grain size

Considers the size of data chunks processed in parallel

Basic Types of Architectural Classification

This categorization of computers was initially researched and suggested by Michael Flynn in 1971. Flynn did not take into account the machine design when classifying parallel computers. Instead, he presented the ideas of instruction and data flows to categorize computers. Not all the computers classified by Flynn are parallel, but to understand parallel computers, it is essential to comprehend all kinds of Flynn's classification.


Because this classification relies on instruction and data flows, we first need to grasp how the instruction cycle functions.

🔄Instruction Cycle

The instruction cycle consists of a sequence of steps needed for the execution of an instruction in a program. A typical instruction in a program is composed of two parts: Opcode and Operand. The Operand part specifies the data on which the specified operation is to be done.


The Operand part is divided into two parts: addressing mode and the Operand. The addressing mode specifies the method of determining the addresses of the actual data on which the operation is to be performed and the operand part is used as an argument by the method in determining the actual address.

The control unit of the central processing unit (CPU) in the computer sequentially obtains instructions from the program, one instruction at a time. The obtained instruction is then interpreted by the decoder, which is part of the control unit, and the processor carries out the interpreted instructions. The outcome of the execution is briefly stored in the Memory Buffer Register (MBR), also known as the Memory Data Register.

📊Instruction Stream and Data Stream

The term 'stream' refers to a sequence or flow of either instructions or data operated on by the computer. In the complete cycle of instruction execution, a flow of instructions from main memory to the CPU is established. This flow of instructions is called instruction stream. Similarly, there is a flow of operands between processor and memory bi-directionally. This flow of operands is called data stream.

Flynn's Taxonomy of Computer Architecture

Parallel computing is a form of computation where tasks are divided into separate pieces that can be worked on at the same time. Each part is further split into a sequence of instructions. The instructions from each part are executed simultaneously on different CPUs.


Parallel systems involve the concurrent use of multiple computer resources, which can include a single computer with multiple processors, several computers linked by a network to create a parallel processing cluster, or a mix of both. Parallel systems are harder to program than single-processor computers because the architecture varies based on the parallel computer and the processes across multiple CPUs need to be coordinated and synchronized.

CPUs are at the core of parallel processing. Based on the number of instruction and data streams that can be handled at the same time, computing systems are categorized into four major types:

📝
SISD

Single-Instruction Single-Data Stream

📊
SIMD

Single-Instruction Multiple-Data Stream

🔄
MISD

Multiple-Instruction Single-Data Stream

📊🔄
MIMD

Multiple-Instruction Multiple-Data Stream

📝SISD (Single Instruction Single Data Stream)

Traditional sequential computers are classified as SISD - [single instruction stream over single data stream] machines. Instructions in these computers are carried out one after another, but their execution phases may overlap (pipelining).

📊SIMD (Single Instruction Multiple Data Stream)

This represents computers with both vector/array processing capabilities as well as scalar hardware. There are multiple processing elements overseen by one control unit. All the processing elements get the same instruction from the control unit but work on different data sets from separate data streams.

🔄MISD (Multiple Instruction Single Data Stream)

The same data stream flows through a linear array of processors executing different instruction streams. This architecture is also known as systolic arrays for pipelined execution of specific algorithms.

📊🔄MIMD (Multiple Instruction Multiple Data Stream)

Most popular model. Parallel computers are reserved for MIMD. Of the four machine models, most parallel computers constructed in the past were based on the MIMD model for general purpose computing tasks. The SIMD and MISD models are better suited for specific computations. As a result, MIMD is the most widely used model, followed by SIMD, while MISD is the least common model implemented in commercial machines.

Feng's Classification

Feng suggested the use of degree of parallelism to classify various computer architectures. Tse-yun Feng suggested the use of degree of parallelism to classify various computer architectures.

  • The maximum number of binary digits that can be processed within a unit time by a computer system is called the maximum parallelism degree P.
  • A bit slice is a string of bits one from each of the words at the same vertical position.

Under above classification:

🔢

Word Serial and Bit Serial (WSBS)

One bit is processed at a time

🔢

Word Parallel and Bit Serial (WPBS)

m-bit slice is processed at a time

🔢

Word Serial and Bit Parallel (WSBP)

One word of n bit processed at a time

🔢

Word Parallel and Bit Parallel (WPBP)

An array of n x m bits is processed at one time

📊Feng's Classification Table

Mode Computer Model Degree of Parallelism Example
WSBS N = 1, M = 1 The "MINIMA" (1, 1)
WPBS N = 1, M > 1 (1, 256), (1, 16384), (1, 4096) STARAN, MPP, DAP
WSBP N > 1, M = 1 (64, 1), (60, 1), (48, 1), (16/32, 1) IBM 370/168 UP, CDC 660, Burrough 7700, VAX 11/780
WPBP N > 1, M > 1 (64, 64) Illiav IV

Handler's Classification

In 1977, Wolfgang Handler proposed an elaborate notation for expressing the pipelining and parallelism of computers. Handler's classification addresses the computer at three distinct levels:

  • Processor control unit (PCU)
  • Arithmetic logic unit (ALU)
  • Bit-level circuit (BLC)

The PCU corresponds to a processor or CPU, the ALU corresponds to a functional unit or a processing element and the BLC corresponds to the logic circuit needed to perform one-bit operations in the ALU.

📝Handler's Notation

Handler's classification uses the following three pairs of integers to describe a computer:

Computer = (p * p', a * a', b * b')

Where:

  • p = number of PCUs
  • p' = number of PCUs that can be pipelined
  • a = number of ALUs controlled by each PCU
  • a' = number of ALUs that can be pipelined
  • b = number of bits in ALU or processing element (PE) word
  • b' = number of pipeline segments on all ALUs or in a single PE

🔧

Operators and Rules

The following rules and operators are used to show the relationship between various elements of the computer:

💻Handler's Classification Examples

CDC 6600

The CDC 6600 has a solitary central processor upheld by 10 I/O processors. One control unit coordinates one ALU with a 60-bit word length. The ALU has 10 functional units which can be assembled into a pipeline. The 10 peripheral I/O processors may work at the same time with one another and with the CPU. Each I/O processor contains one 12-bit ALU.

CDC 6600 = (I/O processors) * (central processor) = (10, 1, 12) * (1, 1 * 10, 60)
Texas Instrument's Advanced Scientific Computer (ASC)

ASC has one controller coordinating four arithmetic units. Each arithmetic unit is an eight stage pipeline with 64-bit words.

ASC = (1, 4, 64 * 8)
Cray-1

Cray-1 is a 64-bit single processor computer whose ALU has twelve functional units, eight of which can be chained together from a pipeline. Different functional units have from 1 to 14 segments, which can also be pipelined.

Cray-1 = (1, 12 * 8, 64 * (1 ~ 14))
C.mmp multiprocessor

The C.mmp was designed to facilitate research into parallel computer architectures, so it can be extensively reconfigured. The system consists of 16 PDP-11 minicomputers with 16-bit word lengths, interconnected by a crossbar switching network.

C.mmp = (16, 1, 16) v (1, 16, 16) v (1 * 16, 1, 16)

The '*' and '+' operators are utilized to bring together multiple separate hardware components. The 'v' operator differs from the other two in that it is employed to combine the various operating modes of a single hardware piece.


While Flynn's categorization is straightforward, Handler's classification is unwieldy. The direct application of numbers in the nomenclature of Handler's classification makes it much more abstract and thus difficult. Handler's classification is highly oriented towards depicting pipelines and chains. Although it can adequately illustrate the parallelism in a single processor, the diversity of parallelism in multiprocessor computers is not well addressed.